According to the Center for Disease Control and Prevention (CDC), cardiovascular or heart diseases account for 1 in every 4 deaths in the United States annually. The estimated costs to the nation’s economy amount to about $363 billion annually, which include hospitalizations, after services health care needs as well as lost productivity due to death and disability (SS Virani, 2021). The CDC also states that given the lifestyle habits of an average American, about half of the population (47%) are reported to have “1 of the 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking”. However, the risk factors for developing heart diseases are vast and diverse and tend to vary between most racial and ethnic groups in the United States.
Therefore, there is an immediate need to detect the major risk factors early on, so that a wellness-oriented healthcare system can be instituted. Ideally, such a system would assess the intensity of the risk factors (compared to a baseline) and focus efforts on preventing the occurrence of developing heart diseases in the first place rather than minimizing second occurrences (Liau et al, 2010; Giampaoli et al, 2005 ).
The beginnings of the current understanding of the interconnected role of risk factors was first discussed in the Framingham Heart Study. Started in 1948, the study’s original goal was “ to identify common factors or characteristics that contribute to cardiovascular disease” (Hong 2018). However, today the FHS has morphed into a multigenerational study gathering genetic information that help analyze family patterns of certain diseases, including cardiovascular ailments (National Heart, Lung and Blood Institute). Jumping off from this, current research in the field has focused on using precision medicine algorithms along with computational development technology to assess a patient’s likelihood of developing heart disease given certain lifestyle factors. Most of the studies have focused largely on Caucasian individuals as a sample and indicating that there is a gap in the usefulness of current risk assessment tools working for individuals of different races.
In this paper, we aim to understand the different risk factors that affect the occurrence of heart diseases among the US population, so that we may be able to see some of the intermingling between different risk factors.
The original data set can be retrieved from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) annual health survey. The survey interviews over four hundred thousand individuals in the USA asking questions regarding the risk factors of heart disease. Conducted in 2020 (published on February 15, 2021), the original data set has been cleaned from its original 300 variables to 18 variables by Kamil Pytlak and can be accessed on Kaggle .
# Lets load the appropriate libraries
library(tidyverse)
library(dplyr)
library(funModeling)
library(ezids)
library(ggplot2)
library(ggcorrplot)
library(corrr)
library(corrplot)
library(ROCR)
library(grid)
library(broom)
library(caret)
library(tidyr)
library(scales)
# ggthemr is not on CRAN. Need to install from console using
#devtools::install_github('cttobin/ggthemr')
#library(ggthemr)
#library(ggthemes)
library(gridExtra)
library(data.table)
# Load the Data
heart <-read.csv("heart_2020_cleaned.csv")
# Descriptive Statistics
status (heart)
From the table we can see that physical and mental health have a higher percentage of zeros. This means that just over 70% of the people in our data set report being physically sick in a 30 day period while about 60% of the respondents report feeling mentally unwell in that same time period.
We can also see that there are no N/A values in our data set. This is not surprising given that our data was cleaned before being retrieved.
Below are the plots showing the means of our numerical variables. We can see that, on average, people with heart disease have a slightly higher BMI than those who don’t have heart disease. However, the biggest difference is the number of days that a respondent feels physically unwell. Respondents who have heart disease reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. Poor mental health in a 30 day period is slightly higher for people with heart disease, than for people without.
# Finding the mean of numerical variables grouped by heart disease
plot_num(heart)
heart %>%
group_by(HeartDisease) %>%
summarise_at(vars("BMI",
"PhysicalHealth",
"MentalHealth",
"SleepTime"), mean)
## # A tibble: 2 × 5
## HeartDisease BMI PhysicalHealth MentalHealth SleepTime
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 No 28.2 2.96 3.83 7.09
## 2 Yes 29.4 7.81 4.64 7.14
# Changing Diabetes bordeline and pregnancy categories to Yes/No
heart$Diabetic<- replace (heart$Diabetic, heart$Diabetic== "No, borderline diabetes" , "Yes")
heart$Diabetic[heart$Diabetic== "Yes (during pregnancy)"] <- "No"
# Changing all the categorical variables to numerical dummies
heart$HeartDisease<-ifelse(heart$HeartDisease=="Yes",1,0)
heart$Smoking<-ifelse(heart$Smoking=="Yes",1,0)
heart$AlcoholDrinking<-ifelse(heart$AlcoholDrinking=="Yes",1,0)
heart$Stroke<-ifelse(heart$Stroke=="Yes",1,0)
heart$DiffWalking<-ifelse(heart$DiffWalking=="Yes",1,0)
heart$PhysicalActivity<-ifelse(heart$PhysicalActivity=="Yes",1,0)
heart$KidneyDisease<-ifelse(heart$KidneyDisease=="Yes",1,0)
heart$SkinCancer<-ifelse(heart$SkinCancer=="Yes",1,0)
heart$Diabetic<-ifelse(heart$Diabetic=="Yes",1,0)
heart$Asthma<-ifelse(heart$Asthma=="Yes",1,0)
According to Noel Bairey Merz, MD, director of the Barbra Streisand Women’s Heart Center in the Smidt Heart Institute, women experience heart disease more severely than men. Meaning that even though men are more likely to suffer heart attacks, women are more likely to die from a heart attack than a man.
We started off by trying to see which gender tends to have more heart disease based on symptoms reported just before diagnosis. From the table below, we can see that we have a lot more data for people without heart disease than with. However, even in the percentage table for those who do have heart disease, we see that there are more men with heart disease (about 5%) compared to women (3%). This is in line with our basic theory that men have more heart disease. However, we are then forced to ask why do men have more heart disease in our sample? Are they doing more or less of a certain activity than their female counterparts? Discussions of the relevant variables are provided below.
contable = table(heart$HeartDisease, heart$Sex)
xkabledply(contable, title="Contingency Table for heart Disease and Sex")
| Female | Male | |
|---|---|---|
| 0 | 156571 | 135851 |
| 1 | 11234 | 16139 |
addmargins(contable)
##
## Female Male Sum
## 0 156571 135851 292422
## 1 11234 16139 27373
## Sum 167805 151990 319795
prop <-prop.table(contable)
percent <- round(prop*100, 2)
print ("Contingency Table for Heart Disease and Sex in Percentage")
## [1] "Contingency Table for Heart Disease and Sex in Percentage"
print (percent)
##
## Female Male
## 0 48.96 42.48
## 1 3.51 5.05
barplot(with(heart, table(HeartDisease, Sex)), main="Heart Disease Among Male & Female", beside=T, col=3:2)
In our exploratory analysis, we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease.
hist(heart$BMI)
In the histogram, we can see that our data is approximately normal and given the Central Limit Theorem, if n>30, then we can assume normality for the sake of statistical tests.
After sub-setting the data between male and female, we conducted a Welch’s t-test on the BMIs of both population subsets.
# Subset Male & Female with and without Heart Disease
female <- subset(heart, Sex== "Female")
male <- subset (heart, Sex == "Male")
# T-test of BMI between Genders
t.test(female$BMI, male$BMI)
##
## Welch Two Sample t-test
##
## data: female$BMI and male$BMI
## t = -15, df = 3e+05, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.387 -0.299
## sample estimates:
## mean of x mean of y
## 28.2 28.5
The p-value is less than the 5% significance level, allowing us to reject the null hypothesis and conclude that the overall average BMI is different between the male and female populations. From the mean values, we see that men have a mean BMI of 28.50 while women have a BMI of 28.16. A difference of 0.34.
Next, we divided our data frame between men and women who do have heart disease. Once again assuming normality due to the large sample size, we conducted another Welch’s T-test to look at the mean BMI difference.
# Subsetting genders based on respondents that have heart disease
female_HD<- subset(female, HeartDisease==1)
male_HD<- subset(male, HeartDisease== 1)
#T-test
t.test(female_HD$BMI, male_HD$BMI)
##
## Welch Two Sample t-test
##
## data: female_HD$BMI and male_HD$BMI
## t = -0.5, df = 20988, p-value = 0.6
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.208 0.121
## sample estimates:
## mean of x mean of y
## 29.4 29.4
From the T-test output above, we see that the p-value is 0.60, which is greater than 0.05. This means that we fail to reject the null and conclude that there is no significant difference in the average BMI between men and women who do have heart disease.
Earlier, we saw that respondents, who have heart disease, reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. The box plot below shows us that women with heart disease generally report felling unwell more often than men.
ggplot(heart, aes(x = factor(HeartDisease), y = PhysicalHealth))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle('How many times in the last 30 days have you felt physically ill?') +
xlab('Heart Disease') +
ylab('Physical Health')+
scale_fill_brewer(palette = 'Set2', name = 'Gender')
t.test(male_HD$PhysicalHealth, female_HD$PhysicalHealth)
##
## Welch Two Sample t-test
##
## data: male_HD$PhysicalHealth and female_HD$PhysicalHealth
## t = -16, df = 23059, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.53 -1.97
## sample estimates:
## mean of x mean of y
## 6.88 9.14
The t-test also confirms what the box plot is showing. Since the p-value is less than 5%, that there is a difference in the number of days that men and women feel unwell. On average, women reported feeling sick about 9 days while men reported just over 6 days.
ggplot(heart, aes(x = factor(HeartDisease), y = MentalHealth))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle('How many times in the last 30 days have you felt mentally ill?') +
xlab('Heart Disease') +
ylab('Mental Health')+
scale_fill_brewer(palette = 'Set3', name = 'Sex')
t.test (male_HD$MentalHealth,
female_HD$MentalHealth)
##
## Welch Two Sample t-test
##
## data: male_HD$MentalHealth and female_HD$MentalHealth
## t = -22, df = 21050, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.79 -2.34
## sample estimates:
## mean of x mean of y
## 3.59 6.16
The boxplot shows that women tend to report feeling mentally unwell more often than men.Since the p-value is less than 5%, indicating that there is a difference in the number of days that men and women feel mentally unwell. On average, women reported feeling mentally unwell about twice as many days as men. The cyclical nature of mental health and exercise has been covered extensively (Mikkelsen 2017; Raglin 1990); which assert that individuals struggling with mental health issues often are unable to complete physically demanding tasks, further making matters worse. This is because exercise has been shown to produce beneficial hormones that are associated with improvements in mental health.
ggplot(heart, aes(x = factor(HeartDisease), y = SleepTime))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle(' On average,how many hours of sleep do you get in a 24-hour period?') +
xlab('Heart Disease') +
ylab('Sleep Time')+
scale_fill_brewer(palette = 'Set', name = 'Gender')
t.test (male_HD$SleepTime,
female_HD$SleepTime)
##
## Welch Two Sample t-test
##
## data: male_HD$SleepTime and female_HD$SleepTime
## t = 4, df = 22378, p-value = 2e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.0515 0.1389
## sample estimates:
## mean of x mean of y
## 7.18 7.08
Men and women have statistically different sleep times. Men on average get about 35 more minutes of sleep a week than women. This may not sound like much but sleep effects are known to be cumulative. A study conducted by Robbins et al. in 2021 “examined a group of people who got the recommended seven to eight hours of sleep a night to those who got less (even just 30 minutes less), and exposed them to a norovirus”. The study found that curtailing sleep below the recommended seven hours increases the risk for viral infection “nearly four times higher among those who are sleeping poorly or getting insufficient sleep, compared to those who are getting good rest” (Robbins et al, 2021).
This lack of sleep could explain why we see that women in our data on average report feeling sick more often and could also be a contributing factors is why women are more likely to die from heart attacks than men (Noel Bairey Merz).
#Smoking:Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]
#Smoking
ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle(' Have you smoked at least 100 cigarettes in your entire life?') +
xlab('Heart Disease') +
ylab('Smoking')+
scale_fill_brewer(palette = 'Set2', name = 'Gender')
t.test ( male_HD$Smoking,
female_HD$Smoking)
#Drinking: How many drinks have you had this week?
ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle('How many drinks have you had this week?') +
xlab('Heart Disease') +
ylab('Alcohol Consumption')+
scale_fill_brewer(palette = 'Set3', name = 'Gender')
t.test ( male_HD$AlcoholDrinking,
female_HD$AlcoholDrinking)
Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week
HD <- subset(heart, HeartDisease==1)
freq(HD$AgeCategory)
## var frequency percentage cumulative_perc
## 1 80 or older 5449 19.91 19.9
## 2 70-74 4847 17.71 37.6
## 3 65-69 4101 14.98 52.6
## 4 75-79 4049 14.79 67.4
## 5 60-64 3327 12.15 79.5
## 6 55-59 2202 8.04 87.6
## 7 50-54 1383 5.05 92.6
## 8 45-49 744 2.72 95.3
## 9 40-44 486 1.78 97.1
## 10 35-39 296 1.08 98.2
## 11 30-34 226 0.83 99.0
## 12 25-29 133 0.49 99.5
## 13 18-24 130 0.47 100.0
freq(HD$Race)
## var frequency percentage cumulative_perc
## 1 White 22507 82.22 82.2
## 2 Black 1729 6.32 88.5
## 3 Hispanic 1443 5.27 93.8
## 4 Other 886 3.24 97.0
## 5 American Indian/Alaskan Native 542 1.98 99.0
## 6 Asian 266 0.97 100.0
freq(HD$GenHealth)
## var frequency percentage cumulative_perc
## 1 Good 9558 34.92 34.9
## 2 Fair 7084 25.88 60.8
## 3 Very good 5381 19.66 80.5
## 4 Poor 3850 14.06 94.5
## 5 Excellent 1500 5.48 100.0
When looking at the population make-up of respondents who do have heart disease, we see that a higher percentage of respondents are 60 and older, with 82% being of Caucasian descent. At the time of the interview they also said they felt their overall health is good. This may be a limiting factor for future research since, once again, the sample does not have many data points from non-Caucasian individuals. We also see that most of our respondents with heart disease are older above 60 years of age, this is in-line with theory.
Even though a logit regression seems more appropriate to estimate whether or not an individual gets heart disease, we can still use a linear regression in our exploratory stages to find the best fitting model.
reg_1 <- lm(HeartDisease~Sex+BMI, data=heart)
BIC(reg_1)
## [1] 90503
summary (reg_1)
##
## Call:
## lm(formula = HeartDisease ~ Sex + BMI, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2521 -0.1024 -0.0805 -0.0587 0.9677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.03e-03 2.29e-03 2.2 0.028 *
## SexMale 3.85e-02 9.87e-04 39.0 <2e-16 ***
## BMI 2.20e-03 7.76e-05 28.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.279 on 319792 degrees of freedom
## Multiple R-squared: 0.0074, Adjusted R-squared: 0.00739
## F-statistic: 1.19e+03 on 2 and 319792 DF, p-value: <2e-16
reg_2 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + SleepTime, data=heart)
BIC(reg_2)
## [1] 80963
summary (reg_2)
##
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth +
## SleepTime, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3585 -0.0918 -0.0726 -0.0425 1.0001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.18e-02 3.41e-03 -6.40 1.5e-10 ***
## SexMale 4.22e-02 9.78e-04 43.10 < 2e-16 ***
## BMI 1.43e-03 7.70e-05 18.55 < 2e-16 ***
## PhysicalHealth 6.18e-03 6.41e-05 96.37 < 2e-16 ***
## MentalHealth -4.95e-04 6.44e-05 -7.69 1.5e-14 ***
## SleepTime 3.95e-03 3.41e-04 11.58 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.275 on 319789 degrees of freedom
## Multiple R-squared: 0.0367, Adjusted R-squared: 0.0367
## F-statistic: 2.44e+03 on 5 and 319789 DF, p-value: <2e-16
reg_3 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + Smoking + AlcoholDrinking+ SleepTime + AgeCategory + Race, data=heart)
BIC(reg_3)
## [1] 60289
summary (reg_3)
##
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth +
## Smoking + AlcoholDrinking + SleepTime + AgeCategory + Race,
## data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4570 -0.1220 -0.0529 -0.0020 1.0648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.16e-02 5.25e-03 -11.74 < 2e-16 ***
## SexMale 4.88e-02 9.57e-04 50.95 < 2e-16 ***
## BMI 2.05e-03 7.62e-05 26.92 < 2e-16 ***
## PhysicalHealth 4.61e-03 6.32e-05 72.95 < 2e-16 ***
## MentalHealth 9.31e-04 6.39e-05 14.56 < 2e-16 ***
## Smoking 3.50e-02 9.92e-04 35.27 < 2e-16 ***
## AlcoholDrinking -2.25e-02 1.89e-03 -11.93 < 2e-16 ***
## SleepTime -1.39e-03 3.33e-04 -4.16 3.1e-05 ***
## AgeCategory25-29 -6.29e-03 2.75e-03 -2.29 0.0222 *
## AgeCategory30-34 -6.45e-03 2.69e-03 -2.40 0.0166 *
## AgeCategory35-39 -5.83e-03 2.64e-03 -2.21 0.0269 *
## AgeCategory40-44 1.14e-03 2.63e-03 0.43 0.6643
## AgeCategory45-49 1.10e-02 2.61e-03 4.20 2.6e-05 ***
## AgeCategory50-54 2.92e-02 2.52e-03 11.57 < 2e-16 ***
## AgeCategory55-59 4.63e-02 2.44e-03 18.96 < 2e-16 ***
## AgeCategory60-64 6.98e-02 2.39e-03 29.17 < 2e-16 ***
## AgeCategory65-69 9.48e-02 2.39e-03 39.60 < 2e-16 ***
## AgeCategory70-74 1.32e-01 2.44e-03 54.00 < 2e-16 ***
## AgeCategory75-79 1.65e-01 2.65e-03 62.18 < 2e-16 ***
## AgeCategory80 or older 2.08e-01 2.58e-03 80.59 < 2e-16 ***
## RaceAsian -2.45e-02 4.75e-03 -5.15 2.5e-07 ***
## RaceBlack -2.01e-02 4.09e-03 -4.91 9.1e-07 ***
## RaceHispanic -1.75e-02 4.03e-03 -4.35 1.4e-05 ***
## RaceOther -1.30e-02 4.48e-03 -2.90 0.0037 **
## RaceWhite -2.04e-02 3.73e-03 -5.48 4.3e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.266 on 319770 degrees of freedom
## Multiple R-squared: 0.0977, Adjusted R-squared: 0.0976
## F-statistic: 1.44e+03 on 24 and 319770 DF, p-value: <2e-16
With each new addition of a variable, we see that the adjusted R-squared is increasing indicating that the model fit is improving. However, the fit statistic is still quite low. Note that the 40-44 years age category has insignificant p-values. Since model 3 has the highest adjusted r-squared and the lowest BIC values, it will be our preferred model.
ezids::xkablevif(reg_3, wide=FALSE)
| V1 | |
|---|---|
| AgeCategory25-29 | 1.03 |
| AgeCategory30-34 | 1.06 |
| AgeCategory35-39 | 1.14 |
| AgeCategory40-44 | 1.17 |
| AgeCategory45-49 | 1.08 |
| AgeCategory50-54 | 1.02 |
| AgeCategory55-59 | 1.04 |
| AgeCategory60-64 | 1.72 |
| AgeCategory65-69 | 1.81 |
| AgeCategory70-74 | 1.89 |
| AgeCategory75-79 | 1.92 |
| AgeCategory80 or older | 1.95 |
| AlcoholDrinking | 2.10 |
| BMI | 2.28 |
| MentalHealth | 2.45 |
| PhysicalHealth | 2.47 |
| RaceAsian | 2.37 |
| RaceBlack | 1.99 |
| RaceHispanic | 2.10 |
| RaceOther | 2.51 |
| RaceWhite | 5.04 |
| SexMale | 5.77 |
| SleepTime | 3.00 |
| Smoking | 11.28 |
From, the VIF table, the VIF value for smoking is 11.28, this is undesirable and thus will be dropped. Therefore our preferred model is as follows:
HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory
We will train and test the model to see the cost analysis of our predictions.
In order to address the unbalanced classification, we subsetted our original data frame to create a new one (called total) that had equal proportions of the Heart Disease category. The new dataset, called total, has over 54000 observation that have been split into train and test sets.
# Creating a new dataframe (called total) with equal proportions of Heart Disease values
newdata1<- subset(heart, HeartDisease==1)
newdata1_1<-head(newdata1,27373)
nrow(newdata1_1)
newdata0<- subset(heart, HeartDisease==0)
newdata0_1<-head(newdata0,27373)
nrow(newdata0_1)
total <- rbind(newdata1_1, newdata0_1)
#nrow(total)
summary(total)
#freq(total$HeartDisease)
# Convert to factor variables
total$HeartDisease<-factor(total$HeartDisease)
total$Sex<-factor(total$Sex)
# Split into 80-20 train-test sets
set.seed(1234)
sample <- sample.int(n = nrow(total), size = floor(.80*nrow(total)), replace = F)
train <- total[sample, ]
test <- total[-sample, ]
# Logit Model Training
model<- glm(HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory, data= train, family='binomial')
#summary(model)
library(lemon)
knit_print.data.frame <- lemon_print
#Chi-squared test for model fit
anova(model, test = 'Chisq')
# Exponential Coefficients & Confidence Interval
exp_coeff<- exp(cbind(OR = coef(model), confint(model)))
knitr::kable(exp_coeff,caption='Odds Ratio & Confidence Interval of Model')
Looking at the anova chi-squared results, we can conclude that the logit model (HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory) used is a good fit model for predicting the probability of getting heart disease. All the coefficient p-values in the model are also significant, except for the 25-39 age range, which is significant at the 10% level.
According to the model, the odds for getting heart disease increases by 2.11 for a man compared to a woman. While an increase in BMI increases the odds for heart disease by 1.03. Mental health, drinking, sleep and Physical health also affect your chances of getting heart disease but are not the most salient risk factors; they could be used as early indicators. As expected, the odds of getting heart disease for age increases as an individual gets older.
Also note, the confidence intervals for all the variables are narrow, indicating that they are good predictors of heart disease. However, the confidence intervals for age gets wider as we increase an individuals age. Although the uncertainty is greater at higher age ranges, there still may be enough precision to inform a patient about preventative measures, when considering their lifestyle choices.
library(ggthemes)
# Prediction
train$prediction <- predict( model, newdata = train, type = "response" )
test$prediction <- predict( model, newdata = test , type = "response" )
# distribution of the prediction score grouped by known outcome
ggplot( train, aes( prediction, color = as.factor(HeartDisease) ) ) +
geom_density( size = 1 ) +
ggtitle( "Training Set's Predicted Score" ) +
scale_color_economist( name = "data", labels = c( "No Heart Disease", "Yes Heart Disease" ) ) +
theme_economist()
# Positive instance = 1 = Yes Heart Disease
# Negative instance = 0 = No Heart Disease
After obtaining the predicted values that an individual will have heart disease on both the training and testing sets, a double density plot was drawn to show those predictions. Since we split our data evenly, the negative instances are skewed to the left while the positive instances are skewed to the right.
# user-defined different cost for false negative and false positive
source('unbalanced_functions.R')
cost.fp = 100
cost.fn = 200
roc_info <- ROCInfo( data = test,
predict = "prediction",
actual = "HeartDisease",
cost.fp = 100,
cost.fn = 200 )
grid.draw(roc_info$plot)
Cots for false negatives are higher than false positives.
loadPkg("regclass")
xkabledply( confusion_matrix(model), title = "Confusion Matrix from Logit Model" )
unloadPkg("regclass")
#
cm_info <- ConfusionMatrixInfo( data = test,
predict = "prediction",
actual = 'HeartDisease',
cutoff = .30 )
#ggthemr("flat")
plot(cm_info$plot)